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ABSTRACT 

A major problem in achieving significant speed-up on parallel machines is the overhead 
involved with synchronizing the concurrent processes. Removing the synchronization con- 
straint has the potential of speeding up the computation. We present asynchronous (AS) 
and corrected-asynchronous (CA) finite difference schemes for the multi-dimensional heat 
equation. Although our discussion concentrates on the Euler scheme for the solution of the 
heat equation, it has the potential of being extended to other schemes and other parabolic 
PDEs. These schemes are analyzed and implemented on the shared-memory multi-user 
Sequent Balance machine. Numerical results for one and two dimensional problems are pre- 
sented. It is shown experimentally that synchronization penalty can be about 50% of run 
time: in most cases, the asynchronous scheme runs twice as fast as the parallel synchronous 
scheme. In general, the efficiency of the parallel schemes increases with processor load, with 
the time-level, and with the problem dimension. The efficiency of the AS may reach 90% 
and over, but it provides accurate results only for steady-state values. The CA, on the other 
hand, is less efficient but provides more accurate results for intermediate (non steady-state) 
values. 
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1 Introduction 


As parallel machines become more popular, most algorithms used on parallel machines 
still rely heavily on synchronizing the concurrent processes. There is an inherent inefficiency 
in the synchronization requirement, which is two-fold. First, a fast processor is delayed until 
the slowest processor finishes. Thus, the pace of the algorithm is dictated by the slowest 
processor. There are various reasons why certain processors will be ahead of the others, even 
when they are physically configured at the same speed. 

• Random noise 

Any unpredictable perturbation that can cause a momentary delay. 

• Multi-user environment 

In such an environment many tasks are simultaneously assigned to each processor. 
Since a single processor can be used for any of those tasks at any particular time, we 
cannot predict how much will be devoted for the computation of our algorithm. 

• Master/Slave 

In such an environment one processor, the master , is doing additional tasks to those 
performed by the other processors for example, i/o operations, scheduling or load 
balancing tasks. Therefore, it may be slow in performing our algorithm. 

• Load Balancing 

In many problems it may be logical to divide the problem unequally among the proces- 
sors due to different boundary conditions or approximation schemes that are used. In 
general the partition of nodes among processors is dictated by numerical and physical 
considerations rather than just by computer architecture considerations. This extends 
the duration of an iteration for those processors that are assigned to do more work and 
therefore those would be slower than the others. 

Second, there is a delay period associated with the synchronization mechanism itself 
whether it is setting the semaphores in a shared memory environment or waiting on a message 
to arrive in a message-passing environment, or any other possible implementation (i.e. when 
possible, setting a time bound on the duration of an iteration, so that the next iteration 
starts only after this time bound is exhausted). No matter what implementation is used, 
a slow communication channel slows the progress of the entire computation. Moreover, in 
some situations synchronization may cause contentions over communication resources and 
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memory access, that require careful implementation using additional mechanisms such as 
locking. 

A particular case where tasks are to be repeated is that of iterative computation. Numer- 
ical properties of iterative solutions to PDEs are usually based on the assumption that the 
iterations are synchronized. This is equivalent to assuming that the algorithm is governed 
by a global clock so that the start of each iteration is simultaneous for all processors. In 
implementing a synchronous algorithm in an inherently asynchronous architecture a synchro- 
nization mechanism is used in order to guarantee the correct execution. This considerably 
degrades the efficiency of those algorithms. An asynchronous algorithm can potentially re- 
duce the synchronization penalty since each processor can execute more iterations when it 
is not constrained to wait for the most recent results of the computation in other processors. 
In addition, asynchronous algorithms eliminate the programming efforts involved in setting 
up and debugging the synchronization mechanism and also simplify the task management. 

The usage of asynchronous iterations for an iterative solution of systems of linear equa- 
tions, is due to [6]. Asynchronous Iterative Methods for Multiprocessors are also discussed 
in [4], and are used for the solution of ordinary differential equations by [11]. In this paper, 
we present an asynchronous iterative methods for the solution of partial differential equa- 
tions. These methods are based on Euler explicit finite difference schemes. In Section 2 we 
present the parabolic PDE which will serve as our model problem, the corresponding finite 
difference scheme and its asynchronous parallel modification. Section 3 analyzes our asyn- 
chronous scheme and determines the conditions under which the asynchronous iterations 
work. To compensate for the inaccuracy of non steady-state values of this asynchronous 
scheme, we propose and discuss in Section 4 the corrected-asynchronous scheme, that while 
still being asynchronous, performs some extra extrapolation calculations. Numerical results 
are presented in Section 5. Finally, our results are summarized in Section 6. 

2 The Problem and its Asynchronous Solution 

We demonstrate our approach for asynchronous solution on the multi dimensional heat 
equation. The same approach can be extended and generalized to other types of problems. 
For our model problem we consider the simplest parabolic equation, the heat equation in 
d- variables, 

du _ 2 d 2 u 

dt fr' 1 Cls dx 3 2 

in a rectangular domain, with accompanying initial and boundary conditions, where a 2 s 
(1 < s < d) are constant positive coefficients. For our model problem we consider Dirichlet 
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Figure 1: Model problem - Domain discretization 


boundary conditions. However, other conditions are equally applicable. A one dimensional 
model problem is, 

du 2 d 2 u . 

~Ql = a i fa 2 for 0 < x < 1 0 <t <T 

with initial condition, 

u(x,0) = /(x) 0 < x < 1 

and Dirichlet boundary conditions, 


u (M) = Si(0 0 < t < T 

u (M) = s 2 (0 o<t<r 

where aj is a constant positive coefficient. 

After discretizing the domain (see Fig. 1) we approximate Eq. (1) at grid-point (x,-,f n ) 
by the forward Euler finite difference approximation: 


( 2 ) 


,n+l _ 


= v i + Yl 


r,S a v T t 


where, r s = 1 < a < d are held as constants; i (1 < i < p) denotes the index of 

the spatial coordinate x of a particular grid point; d denotes the dimension of the problem; 
and the operator 6 2 denotes a central difference in the direction of s. 

The one dimensional version of this scheme is, 


u" +1 = + r(v"_ x - 2u" + v ? +1 ) 

where, r = is held constant, and the values of vf (1 < i < p), v %, and i£ +1 are 

determined by the initial and boundary conditions, respectively. 
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It is well known that the scheme of Eq. (2) is stable if 

(3) 0<£r,<i 

S ** 

Although the following discussion concentrates on the Euler scheme for the solution of the 
heat equation, it has the potential of being extended to other schemes and other equations. 
In this paper we only consider the simple scheme (2). Extensions to other explicit schemes 
are straightforward. We emphasize in this study the analysis of asynchronous schemes. 
Comparisons for ADI schemes will be performed in a work in preparation. 


2.1 Asynchronous (AS) Model 

We now present the model of asynchronous iterations to the numerical problem described. 
The definition of chaotic iteration was presented in [6]. The formal definition of asynchronous 
iteration presented below is a modified version of what is discussed in [4], [7], and [11]. 


Definition 1 Let F t - be a computational task to be done on the i th component of a given 
vector u, taking as its arguments components of u in the neighborhood of i. Asynchronous 
iterations ( F, u, J, A) corresponding to these tasks and starting with a given vector, u° , are 
a sequence of vectors defined recursively by: 


(4) 


< +1 = 


u i 

Fi{ 


u 


n,+ii(l,t,n) n+d(2,i,n) 
i u 2 


,U L 


if i J n 

n+d(Ll,>) ) if ieJn 


where, n is a global value of an iteration level, u = (u x , . . . ,u L ), F = (Fi, . . . , F L ) , J = 
{ J n \n = 1, 2, . . .} is a sequence of non-empty subsets of {1, . . . ,X}, A = {A*, . . . , Aif] , and 
Ai is a sequence of elements in 7 L l , A ,• = {(d(l,i,n),. . . ,d(L,i,n))\n =1,2,...} Vi — 
1 


No assumptions are made on the relations between the calculations of the grid points, 
except a non-starvation condition which guarantees that the i th grid point is updated an 
infinite number of times, i.e i occurs infinitely often in the sets J„. This means that no point 
is abandoned forever, and consequently, no processor is held forever executing the same 
iteration. Synchronous iterations are obtained from this model when d(k,i,n ) = 0 VA;,f,n. 

A single or several grid points may be assigned to each processor of a MIMD machine with 
p asynchronous processors. For simplicity also assume that values stored by each processor 
are available to the rest by means of a shared memory. Yet, with some minor modifications 
this model is equally applicable to message-passing architectures, where in fact, the overhead 
of synchronization is more significant. 
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Let us now take the local point of view and consider the i th grid point which is assigned, 
perhaps with some of its neighbors, to a specific processor. We present the following modified 
asynchronous difference equation for the multi-dimensional heat equation: 


( 5 ) 


u 


n.+l _ 


= u ? + 'Z2 a=1 r ° 6 < 




Here, n,- denotes the last completed iteration at the i th grid point and 6 2 u?‘ is a central 
difference of the value of u,- at the n\ h iteration, using the most recent available values of the 
neighboring points. In the s th coordinate direction the neighboring values are denoted by 
u f _ x and u i+la ‘ , respectively. 

The values of a”‘ and /?£* correspond to the delay or advancement of the iteration number 
of the neighboring grid point along the s-coordinate axis, relative to the i th grid point, when 
evaluating its (n* + l) t/l iteration. 

For example, in one dimension we obtain 


,n,+i 




n, . „ ni+0"' ■ 


where, r = 

For our analysis we require a and 0 to be bounded, i.e. no node falls infinitely behind 
its neighbors. We note that for each iteration, a and 0 are functions of i, the number of the 
processor, and not of x. Hence as we add more intermediate points, the difference between 
the adjacent iteration levels does not approach zero. 


3 Analysis of the Asynchronous Scheme 


Lemma 1 When a and 0 are bounded, the asynchronous finite difference approximation Eq. 
(5) is consistent with the following heat equation: 


( 6 ) 


where 


, I<(x,t ) = 


^ = /C(f,()V.[Vu] 

a?Al 


, and r s = 0} are constants. 


-E: =1 r.[«.(*.*)+A(*.*)l' - (A*,F 

Proof: Taylor series expansion of Eq. (5) yields, 

= „”i + ^ + [£‘ =i r.(cC + - “H 


. (7) 

• Thus, 


original scheme 


perturbation 

+0[(At) 2 + Ef =1 (At)(A^)] 


M _ v* r r a rn + 0 n ')] u d-~ !fil = y d S^±- 


(Az s ) 2 
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approximates Eq. (6) with truncation error of 0[(At) + £f=i(Axj)]. a 

The coefficient, K, in Eq. (6) depends on a s and ffi and thus can become negative. It 
is well-known that the multi-dimensional heat equation Eq. (6) is well-posed for a positive 
coefficient, K > 0, and can be ill-posed for a negative coefficient. Thus, a sufficient condition 
for well-posedness is 1 — J2s r s(&s+0s) > 0. Consequently, for the special case where r s = r Vs 
and r < A we obtain, ]C«( a a + ffi) < 2d. However, this leads to a severe restriction on a and 
/ 3 so the scheme is not completely asynchronous. A weaker condition for the well-posedness 
of Eq. (6) can be set, requiring positiveness only in some average sense. 

Lemma 2 A sufficient condition for the well-posedness of Eq. (6) for large times when 

I<i(t)<K{x,t)<K 2 (t) Vx 

is 

[ Ki(£)d£ >0 for < large 
J o 

Proof: Assume K(x,t ) satisfies, K\(t) < K(x,t ) < K 2 {t) Vx. Using separation of 
variables we obtain a general solution to the equation v t = Ki(t)W • [Vu] : 

v (x,t) = XT=i A ne~ Xnm <Pn(x) 

where, P(t) = /o Ai(£)d£ and V • [V(/5 n ] + A <p n = 0. <p n {x) are the appropriate eigenfunctions 
of the steady-state equation with eigenvalue A„ > 0. Accordingly, v t = • [Vu] has a 

solution 

v(x,t) = ^^ =1 P n e' An ^° K2( ° d Vn(^) 

For the one- dimensional case we obtain, A n = n 2 > 0 ; <p n {x) = sin(mcx) and for two 
dimensions, A n = n 2 + m 2 > 0 ; (p n (x,y^ = sin(n7rx)sin(m7rt/). 

For large times only the first mode <pi(x) is important. In this case the equation is 
reducible to an ordinary differential equation and hence 

e~*fo K2 ^ d tip(x) < u(x,t ) < e~*f° Kl ^) dt £<p(x) 

Thus, sufficient conditions for u to converge to a steady-state are: 

1. P(t) > —6 Vt for some 8 

2. A n = o(e -An5 ) n — > oo 

3. lim i _ >co P(t) = oo 
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A possible way of achieving convergence is restricting K > 0 at the beginning. This ensures 
that P(t ) is initially positive (see Fig. 2). If cond. 2 is not true and P(t ) is negative for 
some t, then 

V • [Vv] = -£: =i AM n e- ( V(f) 

may not converge. To enforce cond. 2 we can start synchronously and only later let the 
processors run asynchronously. Another possibility in case P(t ) becomes negative, P(t) > 
—6 Vi, is to filter the initial conditions, to damp the high frequencies 

u(x,0) = f(x ) = ^A n y n {x) 

and A n satisfies cond. 2, e.g A n = A n e~ xn for n > N. § 


Lemma 3 The asynchronous finite difference scheme Eq. (5) is stable for 

d 1 

o <£>,<- 

i=i z 

Proof: Assume that the iteration of the i th grid point had been completed and its next 
iteration is currently being performed (see Fig. 3). Since (3) is valid, then the absolute 
values of the coefficients of u values on the right hand side of Eq. (5) sum to 1. 

K' +1 | < max{|u”*|, |i£t> |, |Cf‘* I |1 <s<d} 

Using the last inequality recursively by backtracking the origins of each relevant grid point, 
we finally reach the initial values, thus 

|<' +l | < maxylujl 


By lemma 1, lemma 3, and the Lax Equivalence Theorem the scheme is convergent in 
the maximum norm. (We also note that this proof applies to any scheme for which all the 
coefficients are between 0 and 1. For parabolic type of equations, higher order schemes can 
be constructed with this property.) Hence, 
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Figure 3: Stability analysis - processors status 


Lemma 4 The asynchronous finite difference scheme is convergent to Eq. (6), under the 
same restrictions as Lemma 3. 


This approach for proving the convergence of the AS scheme was separately derived and 
used by the third author in the paper [1], and it was presented in the [10]. 

Assume now that the solution of the asynchronous scheme in Eq. (5) had been completely 
evaluated up to a time level T = NAt, and it is now interpolated into a smooth function 
w(x, t ) over the entire region under discussion. We denote by v the discrete solution of the 
synchronous solution of Eq. (2) and assume that both w and v are satisfying the required 
boundary and initial conditions. Let ef be the difference between them at an interior grid 
point Pi for a global value of n; 

(8) «? = «-;- v? 


and let, 


£<") = 


/ e" \ 


V e? ) 


be the corresponding difference vector, in some specific order, associated with all interior 
spatial grid points Pi (1 < i < I) at the same global time level t = nAt (n < N). In 
addition, let 

' ' ' 1 - ELi r.« + PI) 

where < and /?£ are the corresponding delays or advancements d(k, i, n ) in Eq. (4) of the 
neighbors of the i th grid point along the s-coordinate axis, while ie" +1 had been evaluated. 
Then we can prove the following lemma which bounds e", given a specific grid spacing. 
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Lemma 5 If a and ft are bounded by M and 

(10) 0< 7t n <i Vi.n 

£ 

then for nAt < T 

(11) Halloo < M£max|vf +1 - vf\ + T • 0{(At) + £(A*.)] 

k=0 ' 3 

Remark 1 The last inequality should be read as follows: there exists a constant R ( which 
is, in fact, the bound on the second-order derivative of w(x,t)) such that 

||i<" +1 >|| < M •£ max |»f+i _ „*| + R . T[(At) + £(Ax,)] 

k = 0 * , 

for all sufficiently small At and Ax s (1 < s < d). Note that R depends on w(x,tf which in 
turn depends on the mesh spacing. 


Proof: Using the Taylor Theorem of the Mean and neglecting terms of 0[(At) 2 +E s (Ax s )( At)] 
we obtain from Eqs. (5) and (8), 


( 12 ) 


[i-SJ=.r.(oj + «;)K 


n+1 


= Ef=ir,(e?_ 1# +£? f u)+ 

+ [! ~ 2 ELi r s - Ef=i^« 3 + /?,")]£• 
+ELi rs(*z+Pim + '-v?) 


in which ls and £" +lj denote the difference values at the neighbors of the i th grid point 
along the s-coordinate axis. In a matrix form one obtains, 


(13) £< n+1 ) = yl( n )£< n ) + gi n ) 

In Eq. (13) is a (2d + l)-diagonal matrix with main diagonal terms of the form 

1 - rUi’-Ja 1 ;, + A") 

and d additional diagonals on each side with terms of the form 


1 - E^i r *« + ft.) 

while is a vector whose components are given by 


1 < s < d 


(14) 


g n - + Pi.) ( n+1 _ n\ 

* 1 -EL^kW' v{) 
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Following is an example of a 9 x 9 matrix, corresponding in some specific order to a 3 x 3 
interior grid, for the two dimensional problem: 


/ izii 


‘'i 
r 2.- 
1 — vi 


1 


l-v\ 

kzh. 
1—1/2 
12- 
1 — 1/3 


1-1/5 


1 — 1*2 
l=£a. 
1 - 1/3 


1-1/6 


1 - 1/1 


l-J / 4 

1-^5 


1 - 1/7 


\ 


V 


1 - 1*2 


r 2 

I-I/4 

i=ii 

1 - 1/5 

1 - 1/6 


l-t/8 


_ri 


1-1/3 

_n 


r? 

1-1/4 

r i 

1-p 

1-1/6 


1-1/5 

r ? 


1-1/7 

1-1/7 

i-<« 


1-1/8 

1 — 1/8 
**? 

1-1/9 


1-1/9 

(ri + r 2 ) + 

j/,-, and 


1 — 1/6 


1 — 1/8 
izfc. 

1 - 1/9 


in which 1 + /?£) + r 2 (a£ + /?£), & = 2(ri + r 2 ) + Vi, and 

t*\ t 


£<") = 


-n 

£ 2 

£ 3 


\ ^9 / 


£ ?1 \ 
e? 2 

£ ?3 


'21 

-n 

'22 


'32 


\ £ 33 / 


where £?■ is the error of the grid point at the i th row and the j th column at the n th time level. 

It follows from Eq. (10) that all the elements of A ^ are positive and that H-A^Hoo = 1. 
Since e<°) = 0, we obtain from Eq. (13) that 


(15) 

In addition, 


|e <n+I, IU < £ llj“ , ll< 


k=0 


, EjLi *.« + ft”) , 


so that, 


< 


7” 


E’-.K + A”)l2 2M 7 r 


'1-EL.r.K+ff.) ELir.'Sl 

Then (11) follows from the last inequality, from Eq. (10) and from Eq. (15). 1 
Note from Eq. (12) that in the steady-state case when v" +1 — u” — > 0, ||e^| 
increase as n — * 00 , neglecting terms of 0[(At) 2 + £,(Ax,)(A<)]. 


does not 


4 Corrected- Asynchronous (CA) Scheme 

We have shown that the asynchronous iterations are consistent with a PDE that is 
different than the original one. Consequently, time accuracy is lost. This suggests that if we 
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apply a correction during each iteration we can improve our approximation, without requiring 
explicit synchronization. For each processor we now require an extra variable that will serve 
as an iteration counter for that processor. The correction we apply is an extrapolation based 
on those variables. 

Hence, we construct a modified equation whose asynchronous approximation is time 
consistent with the original Eq. (1) by subtracting the perturbation term of Eq. (7), 

<* +1 = + ElMV - («? +1 - + w) 

where, 8 2 denotes a central difference in the space coordinates, with each point in its own 
completed time level. Thus, our Corrected- Asynchronous (CA) scheme is: 


(16) 

provided that, 
(17) 


u, n>+1 = u? + 


i + Ej = 1 r.(a?; + A"')S r " 5 *“‘" 


1 + + A n ;) ? o 


Lemma 6 //<• and fif are bounded Vn,i, s then the Corrected- Asynchronous approxima- 
tion Eq. (16) is consistent with Eq. (1). 

Proof: Using the Taylor Theorem of the Mean we have, 

<5X ,+1 = W + K‘ + A n ;)« ,+1 - <•') + 0{Ax a ){At) + 0(At 2 ) 

Substituting in Eq. (16) we obtain, 

< i+1 = + E r > 8 1 u T + E 0(Ax s )0(At ) + 0(At 2 ) 


a=l 


J=1 


or. 


u? +1 -u? ^ 8 2 uV d 


At ^ (Ax s ) 2 -E 0(Aa: 4 ) + O(At) 
Hence the scheme of Eq. (16) is consistent with Eq. (1). B 


Lemma 7 If 
(18) 


0 < 


E s r a 


<1 


l + + /?";) - 2 

then the CA scheme of Eq. (16) is stable. 


11 



Proof: Analogous to the proof of lemma 3. ■ 

By the Lax Equivalence Theorem, lemma 6 and lemma 7 imply that the CA scheme is 
convergent, provided that condition (18) is valid. 

Condition (18) can be interpreted either as restrictions on a and /?, or as a bound on r s . 
Practically, the bounds on a and /? are determined by various factors as discussed earlier. 
Consequently, they impose the above restriction on r 3 . 

Finally, similar to lemma 5, we now show that at a given point of a specific mesh spacing, 
the difference between the solutions of Eq. (16) and Eq. (2) is bounded by 0[(At )+]C s (Ax.,)], 
provided that condition (3) is valid. 

Lemma 8 Let 

% = - v? 

where th" is the smooth interpolation of the CA scheme, Eq. (16), solution and vf is the 
solution of the synchronous scheme Eq. (2) both with the same initial/boundary conditions. 
In addition, let 

6 {n) = 

be the corresponding difference vector, in some specific order, with all interior spatial grid 
points Pi (L<i< I) at the same global time t = nAt (n < N). If (3) is valid then for 
nAt < T 

(19) ||£ (n) ||oo <r.O[(Ai) + £(Ax s )] 

s 

provided that (17) holds. 

Proof: Along the same lines of the proof of lemma 5 we can show from Eq. (7) that w J* 
satisfies Eq. (2) with an error of 0[{At ) 2 + £ s (Ax s )(Af)]. Hence, if (3) is valid then 

||£ (n+1) ||oo < P (n) ||oo + 0[{At) 2 + £(Ax s )(A t)] 

s 

Since Halloo = 0, by using the inequality recursively, we obtain (19). 

l 

5 Numerical Results 

We have implemented the above algorithms on the shared-memory multi-user Sequent 
Balance [15]. The Sequent systems are commercial multiprocessors, that incorporate identical 
general-purpose 32-bit microprocessors and a single common memory. Each processor can 
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execute both user and kernel code. All processors share a single pool of memory, in addition 
to their own local memory, to enhance resource sharing and communication among different 
processes. Processors, memory modules, and i/o controllers are all plugged into a high-speed 
bus. Thus, the Sequent systems fit our model of an asynchronous multi-processor with the 
ability to exchange data between its processors via shared memory. 

The Sequent systems run the DYNIX operating system, a version of UNIX that supports 
a multi-user environment. Therefore the overhead of a single operation cannot be predicted. 
For example, task creation on another processor involves allocation of a free processor, 
allocation of memory and free entries in system tables, re-mapping, and finally context 
switching. The duration of allocations of both processors and memory is unpredictable 
as it is effected by numerous factors, although context switching itself takes only a few 
hundred machine cycles. As such, the Sequent is an example of a machine where no a priori 
assumptions can be made regarding relative speeds of processing. 

We investigated a serial (SR) implementation of Eq. (2), a parallel synchronous (SY) ver- 
sion of Eq. (2), an asynchronous (AS) implementation Eq. (5), and corrected-asynchronous 
(CA) implementation Eq. (16). 

In our synchronous (SY) version we delay all processors until the last one finishes the 
current iteration using a barrier. In this implementation every processor increments a counter 
upon finishing the current iteration. While checking the counter it can determine whether 
it is the last to finish. If it is, it changes a special hardware managed variable to inform the 
other processors that the iteration is finished. If it is not, it waits for a signal from the last 
processor. For other possible implementations see [8] and [9]. 

Note also that our task was simplified since each processor updates only his own specific 
values, although it may read some other values. Therefore, there is no requirement for 
simultaneous access to update global data. A special mechanism known as locking can be 
used in order to prevent such access in cases where the algorithm permits it. 

The following parameters were considered: 

• Ax, Ay - represent the discretization lengths along the x-axis and the y-axis, respec- 
tively. 

• At - represents the discretization length along the t-axis. 

• T - The time level to be reached for the given problem. We seek the solutions for 
u(x,t) where t <T = NAt . 

• p - The number of processing elements used for the calculation. 

• L - The number of ‘ 4 >1 grid points in the considered domain. 


13 


• r - The clock time elapsed before a specific algorithm completed the calculation of the 
given problem. 

• e - The maximal error magnitude 

• E - The maximal absolute relative error of the numerical result (relative to the analytic 
solution). 

• scheme - Three types of parallel schemes were used: Eq. (2) (SY - Synchronous), Eq. 
(5) ( AS - Asynchronous), Eq. (7) (CA - Corrected Asynchronous), and the serial (SR) 
version of Eq. (2). 

We examined both a one dimensional problem, as well as a two dimensional problem. 

The one dimensional problem considered was 

du d 2 u 
dt dx 2 

with the initial condition, 

u(x,0) = sin(7rx) 

and the Dirichlet boundary conditions, 


u(O.f) = u(l,t) = 0. 


The calculated results were compared ^’Hh the analytic result given by 

u(x,t) = sin(7rx)e -7r *. 


For the two-dimensional problem we examined the following non-homogeneous problem 

du d 2 u d 2 u 


where, 

with the initial condition, 


F = 57T 2 sin(7rx) sin(27ry) 
u(x,y,0) = 0 


and the boundary conditions: 

u(0,y,t) = 0 u(\,y,t) = 0 u(x,0,t) = 0 «(a,l,i) = 0 

The calculated results were compared with the analytic result given, for the steady-state, by 


Uoo = sin(7rx) sin(27ry). 
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T 

= 2( 

e = 10 

-8) 

T 

= 3( 

e = 10 

- 9 ) 

T 

= 4( 

T17 

- 16 ) 




CA 

P 

SY 

AS 

CA 

P 

SY 

AS 

CA 

10 

0.08 

0.1 

0.09 

10 

0.11 

0.15 

0.13 

10 

0.13 

0.19 

0.15 

11 

0.1 

0.12 

0.11 

11 

0.12 

0.16 

K2H 

11 

0.14 

0.2 

0.16 

12 

0.09 

0.12 

0.1 

12 

0.11 

itlfcl 

0.15 

12 

0.13 

0.21 

0.17 

13 

0.09 

0.12 

0.11 

13 

0.12 

0.18 

0.15 

13 

0.15 

0.22 

0.18 

14 

0.09 

0.13 

0.12 

14 

0.12 

0.19 

0.16 

14 

0.15 

0.23 

0.18 

15 

0.1 

0.14 

0.12 

15 

0.12 

0.21 

0.16 

15 

0.15 

0.26 

0.19 

16 

0.1 

0.15 

0.12 

16 

0.12 

0.2 

0.16 

16 

0.12 

0.25 

tarn 


Table 1: Efficiency for the steady-state one dimensional problem with one grid point per 
processor; L = p, Ax = = 0.5 

we also considered the homogeneous non steady-state problem 

du d 2 u d 2 u 

dt dx 2 "** dy 2 

with the initial condition, 

u(x,?/,0) = COs(7rx) COs(7T?/) 

and the following Dirichlet boundary conditions: 

u(0,y,t) = cos(7ry)e~ 2,r2< 

u(l,y,t) =-cos(7 xy)e~ 2 ^ t 
u(x,0,t) = cos(7rx)e _2,r2t 
u{x,l,t) — — cos(7rx)e~ 2,r2 *. 

The calculated results were compared with the analytic solution given by 

u(x,y,t ) = cos(7rx) cos(7ry)e -27r2 ' 

Performance results are shown in Table 1 for the one dimensional problem with one grid 
point per processor, and in Tables 2 and 3 for the one dimensional problem with several 
points per processor, all for the steady-state problem. In Table 4 we present results for 
the two-dimensional non-homogeneous steady-state problem. Non steady-state results are 
shown in Table 5 for the one dimensional problem with several points per processor, and in 
Table 6 for the two-dimensional problem. 

Table 1 presents the efficiency, which is ' i by: 

Efficiency = — 

P ' T scheme \P j ‘ ) 
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p 

T = 2 (e = 10- 9 ) 
SY AS CA 

T = 3 (e = 10~ 13 ) 
SY AS CA 

T = 4 (e = 10- 17 ) 
SY AS CA 

2 

0.78 

0.96 

0.53 

0.79 

0.96 

0.53 

0.79 

0.96 

0.53 

3 

0.76 

0.95 

0.52 

0.77 

0.96 

0.52 

0.77 

0.96 

0.52 

4 

0.74 

0.95 

0.52 

0.75 

0.96 

0.52 

0.74 

0.96 

0.52 

6 

0.66 

0.91 

0.51 

0.68 

0.93 

0.51 

0.66 

0.93 

0.51 

8 

0.6 

0.87 

0.49 

0.62 

0.89 

0.5 

0.62 

0.9 

0.5 

12 

0.49 

0.77 

0.49 

0.5 

0.8 

0.49 

0.5 

0.82 

0.51 

16 

0.37 

0.67 

0.41 

0.4 

0.73 

0.44 

0.4 

0.75 

0.44 


Table 2: Efficiency for the steady-state one dimensional problem with j points per processor; 
L = 48, Ax = 0.02128, = 0.5 

where tsr(1,L ) is the run time of the serial version, solving with its single processor a 
problem of L grid points, and r sc ^ eme (p, L) is the run time of the above parallel schemes, 
solving the same problem with p processors. 

We observe from Table 1 that for several processors (more than 10) the efficiency is 10% 
- 25 %. We emphasize, however, that much better preformance is obtained when several grid 
points are assigned to each processor, as shown in Table 2. In this case, the efficiency of the 
AS scheme reaches about 90% and more, for small number of processors, while the efficiency 
of the CA scheme is about 50%. It decreases slightly for the CA, and more significantly for 
the AS, as the number of processors increases. For large number of processors, the efficiency 
of CA is slightly higher than that of the SY. In most cases, as T increases, the efficiency 
of all the parallel algorithms slightly improves. In any case, the AS is the most efficient 
scheme to compute a fixed number of time steps. The synchronization penalty is indicated 
clearly by Table 3, which shows the run time of the AS and CA schemes relative to the SY 
run time. As expected, the AS scheme proves to be the fastest scheme. In most cases it 
is almost twice as fast as the SY scheme. The CA scheme is faster than the SY only for 
large number of processors, and even then, as was mentioned above, its efficiency is about 
the same as that of the SY. In general, as the number of processors increases, there are 
more independent entities to synchronize, and there is a better chance for one to delay the 
rest. Yet, the effect of this increase is not dramatic if all processors are hardware identical 
and are synchronized per iteration and therefore they are executing more or less the same 
number of iterations. Note that, the overall system lo. ' as a direct impact on the execution 
of sophisticated system tasks such as synchronizatio; . vices, in multi-users environments 
such as the Sequent system. Thus, we may see significant delays as the system is loaded 
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with other tasks. This increases the run time of the SY scheme, as the number of processors 
increases and thereby decreases the relative run time of AS and CA, as observed in Table 3. 
However, these delays also cause loss of accuracy of the AS and the CA schemes. 


p 



T = 4 
AS CA 

2 

0.82 

K1EE11 

0.82 

1.5 

0.82 

1.5 

3 

0.8 

1.46 

0.8 

1.46 

0.8 

1.46 

4 

0.77 

1.43 

0.77 

1.43 

0.77 

1.42 

6 

0.72 

1.31 

0.73 

1.33 

0.71 

1.29 

8 

0.69 

1.23 

0.69 

1.24 

0.69 

1.24 

12 

0.63 

IflEEll 

0.65 

1.0 

0.61 

0.99 

16 

0.55 

0.89 

0.54 

0.89 

0.55 

0.9 


Table 3: Run time relative to SY for the steady-state one dimensional problem with ^ points 
per processor; all parameters are as in Table 2 

Results for the non-homogeneous two-dimensional problem shown in Table 4 indicate 
similar performan ce. In thi s case, we have a non-trivial steady-state. The iterations are 

calculated until y * J - < 10~ 2 . Again the AS is much faster than the SY and the CA 

is at best as fast as the SY. 


P 

pts in 
blk 

Efficiency 
SY AS CA 

Rel. Run Time 
AS CA 

2 

8 x 16 

0.66 

1.0 

0.60 

0.59 

1.11 

4 

4 x 16 

0.38 

0.645 

0.35 

0.58 

1.09 

8 

2 x 16 

0.22 

0.38 

0.205 

0.56 

1.07 

16 

1 x 16 

0.11 

0.20 

0.10 

0.55 

1.03 


Table 4: Efficiency and run time relative to SY for the 2-dim. steady-state non-homogeneous 
problem; L = 256, Ax = Ay = 0.0667, j££j 7 + = 0-5; Calculated until \Z ^ 

< 10~ 2 

The efficiency and the accuracy of intermediate (non steady-state) values are shown 
in Table 5. In general, for relatively small times T, the asynchronous scheme provides 
the least accurate results. Therefore, the asynchronous scheme should be applied only to 
steady state problems. Note that unlike synchronous iterations that steadily converge, the 
convergence of asynchronous iterations may not be monotone, thus affecting the stopping 
criteria. The efficiency of these schemes, however, decreases as the number of processors 
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increases. Then, the overhead of initiating another processor becomes more significant than 
the work it actually performs. The CA scheme is a compromise of the two. It improves the 



Efficiency 
SY AS CA 

E 

SY AS CA 

e 

SY AS CA 

IB 

0.78 

0.96 

0.5 

0.00367 

0.9 

0.02 

0.000026 

0.007 

0.00008 

IB 

0.74 

0.94 

0.49 

0.00367 

0.9 

0.05 

0.000026 

0.007 

0.00023 

IB 

0.69 

0.91 

0.47 

0.00367 

0.9 

0.05 

0.000026 

0.007 


IB 

0.59 

0.82 

0.45 

0.00367 

0.9 

0.05 

0.000026 

0.007 

0.0002 

8 

0.5 

0.74 

0.42 

0.00367 

0.9 

0.04 

0.000026 

0.007 

0.00013 

12 

0.36 

0.56 

0.35 

0.00367 

0.9 

0.07 

0.000026 

0.007 

0.00018 

imi 

0.28 

0.42 

0.29 

0.00367 

0.9 

0.04 

0.000026 

0.007 

0.00012 


Table 5: Efficiency and accuracy for the one dimensional problem and with ^ grid points 
per processor; L = 48, Ai = 0.02128, = 0.5, T = 0.5 

accuracy lost in the asynchronous version, by doing some extra extrapolation calculations, 
at the cost of increasing the calculation time. Note that although CA is significantly more 
accurate than AS, it is still less accurate than the SY scheme, because of the lower order of 
its truncation error (0(Ax s ) instead of 0(Ax s 2 )). 

Two dimensional results are shown in Table 6. In this case, the efficiency of all schemes 
proves to be much better than the one dimensional case (in some cases, it is close to and 
even higher than one). The run time improvement for the AS and CA schemes relative to 
SY, is significant when the communication is significant relative to computation. With a 
small number of processors the synchronization penalty is small in comparison to the extra 
computation needed by the CA scheme (as compared to the AS and SY schemes). However, 
as the number of processors increases and there is less work per processor, the run time of 
the CA is close to that of AS. 

Although the intermediate time-level results of the CA scheme are significantly more 
accurate than those of the AS, they are still much less accurate than those of the SY scheme. 
This is due to the dimensional additivity of the truncation error which, as indicated earlier, is 
of lower order than that of the synchronous scheme. Hence, in multi-dimensional problems, 
one should either use a very fine grid, or else higher order schemes. 
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p 

pts in 

blk 

Efficiency 
SY AS CA 

E 

SY AS CA 

e 

SY AS CA 

B 

8 x 16 

0.95 

1.05 

0.51 

0.07 

0.99 

0.17 

0.0000036 

0.000051 

0.0000030 

ID! 

4 x 16 

0.79 

0.90 

0.45 

0.07 

0.99 

0.5 

0.0000036 

0.000051 

0.000018 

8 

2 x 16 

0.67 

0.81 

0.42 

0.07 

0.97 

0.16 

0.0000036 

0.000049 

0.0000028 

16 

1 x 16 

0.45 

0.60 

0.37 

0.07 

0.99 

0.17 

0.0000036 

0.000051 

0.0000030 


Table 6: Efficiency and accuracy for the 2-dim. problem; L = 256, Ax = Ay = 0.0667, + 

^ = 0.5, T = 0.5 

6 Summary 


This paper presents asynchronous (AS) and corrected-asynchronous (CA) finite differ- 
ence schemes (based on the Euler explicit scheme) for the multi-dimensional heat equation, 
to be implemented on MIMD multiprocessors. Although we consider only the heat equa- 
tion, our analysis can be easily modified and extended to other parabolic partial differential 
equations, and other finite difference schemes. Our schemes are analyzed and implemented 
on the shared-memory multi-user Sequent Balance machine. They are compared with the 
corresponding serial (SR) scheme and the parallel synchronous (SY) scheme. In general, the 
efficiency of the parallel schemes increases as more mesh points are assigned to each proces- 
sor, as the time-level increases and as the problem dimension increases. It is proved that 
the AS scheme converges to the solution of a differential equation other than the original 
one. However, it provides accurate steady-state results and its efficiency may reach 90% and 
over. AS may be almost twice as fast as the parallel synchronous (SY) scheme. It is proved 
that unlike the AS scheme, the CA scheme does converge to the solution of the original heat 
equation, under certain requirements. Its efficiency, however, is only about 50% in the best 
cases considered here. Nevertheless, CA offers more accurate results for the intermediate 
(non steady-state) time-levels. Yet, its accuracy is less than the SY scheme, especially in the 
multi-dimensional case, due to the loss of order in the truncation error in each spatial co- 
ordinate. Hence, for intermediate time-level results one should either use a very fine spatial 
net, or else use higher order schemes. 
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